Skip to content

feat(kas): enable topology aware routing on kube-apiserver ClusterIP service#8732

Open
rrp-bot wants to merge 3 commits into
openshift:mainfrom
rrp-bot:kas-topology-aware-routing
Open

feat(kas): enable topology aware routing on kube-apiserver ClusterIP service#8732
rrp-bot wants to merge 3 commits into
openshift:mainfrom
rrp-bot:kas-topology-aware-routing

Conversation

@rrp-bot

@rrp-bot rrp-bot commented Jun 12, 2026

Copy link
Copy Markdown

Summary

Set service.kubernetes.io/topology-mode=Auto on the kube-apiserver ClusterIP service so that callers — KCM, scheduler, OAPI, oauth-apiserver, OLM, konnectivity-server, and all operators — are routed to a KAS pod in their own Availability Zone, avoiding cross-AZ data transfer charges.

Background

AWS charges $0.01/GB per direction for traffic crossing AZ boundaries ($0.02/GB round trip). In a HyperShift management cluster running many hosted control planes, every component calls kube-apiserver constantly via the ClusterIP service. Without Topology Aware Routing (TAR), OVN distributes these calls across all KAS pods regardless of AZ — paying the cross-AZ tax on a large fraction of requests.

How TAR works

TAR is a native Kubernetes feature (stable since 1.27, GA in 1.33) that programs OVN/kube-proxy to prefer zone-local endpoints. The annotation goes on the Service being called — all callers benefit automatically with no changes on their side. There is zero per-request overhead.

Why this is safe

  • Precondition already met: KAS pods are spread across zones via requiredDuringScheduling pod anti-affinity on topology.kubernetes.io/zone, guaranteeing ≥1 endpoint per zone in HA mode.
  • Non-HA / single-replica mode: TAR silently falls back to normal routing if the precondition is not met — safe to set unconditionally.
  • PrivateLink NLB unaffected: The LoadBalancer service used for PrivateLink already has cross-zone LB explicitly enabled (required for regions with 4+ AZs) and is not touched by this change.

Changes

  • kas/service.go: Add service.kubernetes.io/topology-mode=Auto to the ClusterIP path in ReconcileService and to ReconcileServiceClusterIP (Azure/KubeVirt path)
  • kas/service_test.go: Add TestReconcileServiceTopologyAwareRouting and TestReconcileServiceClusterIPTopologyAwareRouting

Verification

After deployment, confirm TAR is active:

kubectl get endpointslices -n <hcp-namespace> \
  -l kubernetes.io/service-name=kube-apiserver \
  -o jsonpath='{.items[0].endpoints[*].hints}'

Zone names in the hints field confirms TAR is working.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • KAS API services now enable topology-aware routing (topology-mode: Auto) for non-public ClusterIP services and when switching services to ClusterIP for Route publishing, improving request locality and reducing cross-AZ latency.
  • Tests

    • Added unit tests covering topology-aware routing behavior across publishing strategies and endpoint access options, including ClusterIP scenarios.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 038f3b41-e1cf-40a3-81b7-f7031b7e3bf0

📥 Commits

Reviewing files that changed from the base of the PR and between 01a55f3 and d5d9a11.

📒 Files selected for processing (1)
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go

📝 Walkthrough

Walkthrough

This PR adds topology-aware routing to the Kubernetes API server (KAS) services by setting the service.kubernetes.io/topology-mode: "Auto" annotation on ClusterIP services. The annotation is configured in two functions: ReconcileService applies it when creating non-public services, and ReconcileServiceClusterIP applies it before finalizing the service configuration. The changes are validated by two new unit tests that confirm the annotation is set correctly across different service strategies (Route and AWS LoadBalancer) and endpoint access configurations.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling topology-aware routing on the kube-apiserver ClusterIP service by setting the annotation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed In service_test.go there are no Ginkgo-style titles (no It/Describe/Context/When) and no t.Run subtest titles; added tests use static Go test function names.
Test Structure And Quality ✅ Passed Added topology-mode tests use t.Run table/subtests with only in-memory Service structs; no cluster wait/Eventually/Consistently, BeforeEach/AfterEach, or NotTo-without-message patterns found in ser...
Topology-Aware Scheduling Compatibility ✅ Passed PR only sets kube-apiserver Service annotation topology-mode=Auto and adds unit tests; it does not introduce any pod scheduling constraints (affinity/spread/replicas/nodeSelector/PDB).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR #8732 changes only kas/service.go and kas/service_test.go, adding Go unit tests that assert service annotations/types; no hardcoded IPv4 or external/internet connectivity assumptions are present.
No-Weak-Crypto ✅ Passed In kas/service.go and service_test.go, no crypto imports or weak-algo strings (MD5/SHA1/DES/RC4/3DES/Blowfish/ECB) and no subtle/secret/token comparisons were found.
Container-Privileges ✅ Passed PR only updates kas/service.go and kas/service_test.go; no container/K8s manifest privilege settings (privileged, hostPID/hostNetwork/hostIPC, SYS_ADMIN, allowPrivilegeEscalation, root) found.
No-Sensitive-Data-In-Logs ✅ Passed Reviewed control-plane-operator/controllers/hostedcontrolplane/kas/service.go and service_test.go for log/printf/klog/t.Log usage and secret-like keywords (token/password/keys/PII); none present—on...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed do-not-merge/needs-area labels Jun 12, 2026
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Hi @rrp-bot. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rrp-bot
Once this PR has been reviewed and has the lgtm label, please assign bryan-cox for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from clebs and devguyio June 12, 2026 18:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/kas/service.go (1)

109-112: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Route strategy is missing the topology-aware routing annotation.

The Route strategy creates a ClusterIP service (line 111), but unlike the LoadBalancer+private case (lines 98-102), it does not set the service.kubernetes.io/topology-mode annotation. This causes TestReconcileServiceTopologyAwareRouting to fail for the "When Route strategy" test case (service_test.go lines 138-146), which expects the annotation to be present.

🔧 Proposed fix to add topology annotation for Route strategy
 case hyperv1.Route:
 	if hcp.Spec.Platform.Type != hyperv1.IBMCloudPlatform || svc.Spec.Type != corev1.ServiceTypeNodePort {
 		svc.Spec.Type = corev1.ServiceTypeClusterIP
+		// Enable topology aware routing so that callers (KCM, scheduler, operators, etc.)
+		// are routed to a KAS pod in their own AZ, avoiding cross-AZ data transfer charges.
+		// KAS pods are spread across zones via requiredDuringScheduling anti-affinity,
+		// satisfying the >=1 endpoint per zone precondition for TAR to activate.
+		svc.Annotations["service.kubernetes.io/topology-mode"] = "Auto"
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/kas/service.go` around
lines 109 - 112, In the hyperv1.Route branch (case hyperv1.Route) where you set
svc.Spec.Type = corev1.ServiceTypeClusterIP, also add the topology-aware routing
annotation svc.ObjectMeta.Annotations["service.kubernetes.io/topology-mode"] =
"TopologyAndEndpoints" (same as the LoadBalancer+private branch) so the Route
strategy produces the topology-aware routing annotation expected by
TestReconcileServiceTopologyAwareRouting; ensure you reference svc and
hcp.Spec.Platform.Type in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@control-plane-operator/controllers/hostedcontrolplane/kas/service_test.go`:
- Around line 138-146: The test fails because ReconcileService in service.go
does not set the "service.alpha.openshift.io/topology-mode" annotation for the
Route publishing strategy; update the Route branch in ReconcileService (the
logic handling hyperv1.ServicePublishingStrategy{Type: hyperv1.Route}) to add
the same topology-mode annotation used for the private LoadBalancer case (set to
"internal" or the value used elsewhere) on the ClusterIP service object so the
expectTARAnnotation check in the test passes.

---

Outside diff comments:
In `@control-plane-operator/controllers/hostedcontrolplane/kas/service.go`:
- Around line 109-112: In the hyperv1.Route branch (case hyperv1.Route) where
you set svc.Spec.Type = corev1.ServiceTypeClusterIP, also add the topology-aware
routing annotation
svc.ObjectMeta.Annotations["service.kubernetes.io/topology-mode"] =
"TopologyAndEndpoints" (same as the LoadBalancer+private branch) so the Route
strategy produces the topology-aware routing annotation expected by
TestReconcileServiceTopologyAwareRouting; ensure you reference svc and
hcp.Spec.Platform.Type in the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c64468be-1fe1-4e2d-bf83-32eabc0a4706

📥 Commits

Reviewing files that changed from the base of the PR and between 39f04eb and 01a55f3.

📒 Files selected for processing (2)
  • control-plane-operator/controllers/hostedcontrolplane/kas/service.go
  • control-plane-operator/controllers/hostedcontrolplane/kas/service_test.go

Comment on lines +138 to +146
name: "When Route strategy it should set topology-mode annotation on ClusterIP service",
hcp: &hyperv1.HostedControlPlane{
Spec: hyperv1.HostedControlPlaneSpec{
Platform: hyperv1.PlatformSpec{Type: hyperv1.AWSPlatform},
},
},
strategy: hyperv1.ServicePublishingStrategy{Type: hyperv1.Route},
expectTARAnnotation: true,
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

This test case will fail due to incomplete implementation.

The test expects ReconcileService to set the topology-mode annotation for Route strategy, but the implementation in service.go does not add the annotation for the Route case (only for LoadBalancer+private). This test case will fail until the Route case is updated to add the annotation.

See the comment on service.go lines 109-112 for the required fix.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/kas/service_test.go`
around lines 138 - 146, The test fails because ReconcileService in service.go
does not set the "service.alpha.openshift.io/topology-mode" annotation for the
Route publishing strategy; update the Route branch in ReconcileService (the
logic handling hyperv1.ServicePublishingStrategy{Type: hyperv1.Route}) to add
the same topology-mode annotation used for the private LoadBalancer case (set to
"internal" or the value used elsewhere) on the ClusterIP service object so the
expectTARAnnotation check in the test passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant